In this tutorial we will go through the basic workflow of training machine learning models for spatial mapping based on remote sensing. To do this we will look at two case studies located in the MarburgOpenForest in Germany: one has the aim to produce a land cover map including different tree species; the other aims at producing a map of Leaf Area Index.
Based on “default” models, we will further discuss the relevance of different validation strategies and the area of applicability.
For this tutorial we need the raster package for processing of the satellite data (note: needs to be replaced by terra soon) as well as the caret package as a wrapper for machine learning (here: randomForest) algorithms. Sf is used for handling of the training data available as vector data (polygons). Mapview is used for spatial visualization of the data. CAST will be used to account for spatial dependencies during model validation as well as for the estimation of the AOA.
rm(list=ls())
#major required packages:
library(raster)
library(caret)
library(mapview)
library(sf)
library(CAST)
#additional required packages:
library(tmap)
library(latticeExtra)
To start with, let’s load and explore the remote sensing raster data as well as the vector data that include the training sites.
mof_sen <- stack("data/sentinel_uniwald.grd")
print(mof_sen)
## class : RasterStack
## dimensions : 522, 588, 306936, 10 (nrow, ncol, ncell, nlayers)
## resolution : 10, 10 (x, y)
## extent : 474200, 480080, 5629540, 5634760 (xmin, xmax, ymin, ymax)
## crs : +proj=utm +zone=32 +datum=WGS84 +units=m +no_defs
## names : T32UMB_20170510T103031_B02, T32UMB_20170510T103031_B03, T32UMB_20170510T103031_B04, T32UMB_20170510T103031_B05, T32UMB_20170510T103031_B06, T32UMB_20170510T103031_B07, T32UMB_20170510T103031_B08, T32UMB_20170510T103031_B11, T32UMB_20170510T103031_B12, NDVI
## min values : 723.0000000, 514.0000000, 294.0000000, 341.8125000, 396.9375000, 440.8125000, 377.0000000, 159.6250000, 84.0625000, -0.2965204
## max values : 8.325000e+03, 9.087000e+03, 1.381000e+04, 7.368750e+03, 8.683812e+03, 9.602312e+03, 1.394100e+04, 1.289450e+04, 1.217225e+04, 8.702505e-01
The RasterStack contains a subset of the optical data from Sentinel-2 (see band information here: https://en.wikipedia.org/wiki/Sentinel-2) given in scaled reflectances (B02-B11). In addition,the NDVI was calculated. Let’s plot the rasterStack to get an idea how the variables look like.
plot(mof_sen)
plotRGB(mof_sen,r=3,g=2,b=1,stretch="lin")
The vector file is read as sf object. It contains the training sites that will be regarded here as a ground truth for the land cover classification.
trainSites <- read_sf("data/trainingsites_LUC.gpkg")
Using mapview’s viewRGB function we can visualize the aerial image channels as true color composite in the geographical context and overlay it with the polygons. Click on the polygons to see which land cover class is assigned to a respective polygon.
viewRGB(mof_sen, r = 3, g = 2, b = 1, map.types = "Esri.WorldImagery")+
mapview(trainSites)
In order to train a machine learning model between the spectral properties and the land cover class, we first need to create a data frame that contains the predictor variables at the location of the training sites as well as the corresponding class information. This data frame can be produced with the extract function. The resulting data frame contains the predictor variables for each pixel overlayed by the polygons. This data frame then still needs to be merged with the information on the land cover class from the sf object.
trainDat <- extract(mof_sen, trainSites, df=TRUE)
trainSites$PolygonID <- 1:nrow(trainSites)
trainDat <- merge(trainDat, trainSites, by.x="ID", by.y="PolygonID")
head(trainDat)
## ID T32UMB_20170510T103031_B02 T32UMB_20170510T103031_B03
## 1 1 839 798
## 2 1 835 773
## 3 1 833 785
## 4 1 842 832
## 5 1 838 781
## 6 1 837 778
## T32UMB_20170510T103031_B04 T32UMB_20170510T103031_B05
## 1 536 1077.875
## 2 490 1068.125
## 3 520 1062.562
## 4 531 1053.688
## 5 498 1043.562
## 6 506 1035.000
## T32UMB_20170510T103031_B06 T32UMB_20170510T103031_B07
## 1 2155.188 2472.625
## 2 2082.062 2417.375
## 3 2094.688 2445.000
## 4 2076.438 2408.188
## 5 2038.312 2380.562
## 6 2058.000 2406.938
## T32UMB_20170510T103031_B08 T32UMB_20170510T103031_B11
## 1 2261 1627.875
## 2 2200 1634.125
## 3 2467 1679.688
## 4 2709 1558.938
## 5 2478 1573.312
## 6 2305 1609.438
## T32UMB_20170510T103031_B12 NDVI id LN Type
## 1 814.6875 0.6167322 NA 5 Laerche
## 2 830.5625 0.6356877 NA 5 Laerche
## 3 841.1250 0.6518246 NA 5 Laerche
## 4 780.3125 0.6722222 NA 5 Laerche
## 5 790.4375 0.6653226 NA 5 Laerche
## 6 794.3125 0.6399858 NA 5 Laerche
## geom
## 1 MULTIPOLYGON (((477384.3 56...
## 2 MULTIPOLYGON (((477384.3 56...
## 3 MULTIPOLYGON (((477384.3 56...
## 4 MULTIPOLYGON (((477384.3 56...
## 5 MULTIPOLYGON (((477384.3 56...
## 6 MULTIPOLYGON (((477384.3 56...
For model training we need to define the predictor and response variables. As predictors we can use basically all information from the raster stack as we might assume they could all be meaningful for the differentiation between the land cover classes. As response variable we use the “Label” column of the data frame.
predictors <- names(mof_sen)
response <- "Type"
We then train a Random Forest model to lean how the classes can be distinguished based on the predictors (note: other algorithms would work as well. See https://topepo.github.io/caret/available-models.html for a list of algorithms available in caret). Caret’s train function is doing this job.
So let’s see how we can then train a “default” random forest model. We specify “rf” as method, indicating that a Random Forest is applied. We reduce the number of trees (ntree) to 75 to speed things up. Note that usually a larger number (>250) is appropriate.
model <- train(trainDat[,predictors],
trainDat[,response],
method="rf",
ntree=75)
model
## Random Forest
##
## 2139 samples
## 10 predictor
## 10 classes: 'Buche', 'Duglasie', 'Eiche', 'Felder', 'Fichte', 'Laerche', 'Siedlung', 'Strasse', 'Wasser', 'Wiese'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2139, 2139, 2139, 2139, 2139, 2139, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8425545 0.8117284
## 6 0.8442552 0.8138404
## 10 0.8333590 0.8007179
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.
To perform the classification we can then use the trained model and apply it to each pixel of the raster stack using the predict function.
prediction <- predict(mof_sen,model)
Then we can then create a map with meaningful colors of the predicted land cover using the tmap package.
cols <- rev(c("palegreen", "blue", "grey", "red", "lightgreen", "forestgreen", "beige","brown","darkgreen","yellowgreen"))
tm_shape(deratify(prediction)) +
tm_raster(palette = cols,title = "LUC")+
tm_scale_bar(bg.color="white",bg.alpha=0.75)+
tm_layout(legend.bg.color = "white",
legend.bg.alpha = 0.75)
Based on this we can now discuss more advanced aspects of cross-validation for performance assessment as well as spatial variable selection strategies.
Before starting model trainign we can specify some control settings using trainControl. For hyperparameter tuning (mtry) as well as for error assessment we use a spatial 3-fold cross-validation. Her, the training data are split into 3 folds but data from the same polygon are always grouped so that they never occur in both, training and testing. Also we make sure that each fold contains data from each land cover class. CAST’s CreateSpacetimeFolds is doing this job when we specify the polygon ID and the class label.
indices <- CreateSpacetimeFolds(trainDat,spacevar = "ID",k=3,class="Type")
ctrl <- trainControl(method="cv",
index = indices$index,
savePredictions = TRUE)
Model training is then again performed using caret’s train function. However we use a wrapper around it that is selecting the predictor variables which are relevant for making predictions to new spatial locations (forward feature selection, fss). We use the Kappa index as metric to select the best model.
# train the model
set.seed(100)
model <- ffs(trainDat[,predictors],
trainDat[,response],
method="rf",
metric="Kappa",
trControl=ctrl,
importance=TRUE,
ntree=75,
verbose=FALSE)
print(model)
## Random Forest
##
## 2139 samples
## 8 predictor
## 10 classes: 'Buche', 'Duglasie', 'Eiche', 'Felder', 'Fichte', 'Laerche', 'Siedlung', 'Strasse', 'Wasser', 'Wiese'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1385, 1435, 1458
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7251515 0.6742339
## 5 0.7262418 0.6753939
## 8 0.7153944 0.6619107
##
## Kappa was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.
plot(varImp(model))
When we print the model (see above) we get a summary of the prediction performance as the average Kappa and Accuracy of the three spatial folds. Looking at all cross-validated predictions together we can get the “global” model performance.
# get all cross-validated predictions:
cvPredictions <- model$pred[model$pred$mtry==model$bestTune$mtry,]
# calculate cross table:
table(cvPredictions$pred,cvPredictions$obs)
##
## Buche Duglasie Eiche Felder Fichte Laerche Siedlung Strasse Wasser
## Buche 224 2 83 0 4 11 1 1 3
## Duglasie 4 122 9 0 18 0 0 0 1
## Eiche 83 2 84 0 14 0 2 2 0
## Felder 1 1 0 515 0 0 0 21 3
## Fichte 3 5 5 0 28 1 0 0 0
## Laerche 3 0 1 0 1 5 0 0 1
## Siedlung 1 0 2 20 0 0 84 21 7
## Strasse 0 0 0 29 0 0 21 104 10
## Wasser 4 3 1 19 7 0 0 7 131
## Wiese 5 0 0 52 5 1 0 21 3
##
## Wiese
## Buche 2
## Duglasie 0
## Eiche 0
## Felder 34
## Fichte 0
## Laerche 0
## Siedlung 1
## Strasse 15
## Wasser 5
## Wiese 260
prediction <- predict(mof_sen,model)
cols <- rev(c("palegreen", "blue", "grey", "red", "lightgreen", "forestgreen", "beige","brown","darkgreen","yellowgreen"))
tm_shape(deratify(prediction)) +
tm_raster(palette = cols,title = "LUC")+
tm_scale_bar(bg.color="white",bg.alpha=0.75)+
tm_layout(legend.bg.color = "white",
legend.bg.alpha = 0.75)
We have seen that technically, the trained model can be applied to the entire area of interest (and beyond…as long as the sentinel predictors are available which they are, even globally). But we should assess if we SHOULD apply our model to the entire area. The model should only be applied to locations that feature predictor properties that are comparable to those of the training data. If dissimilarity to the training data is larger than the dissimmilarity within the training data, the model should not be applied to this location.
AOA <- aoa(mof_sen,model)
plot(AOA$AOA)
The result of the aoa function has two layers: the dissimilarity index (DI) and the area of applicability (AOA). The DI can take values from 0 to Inf, where 0 means that a location has predictor properties that are identical to properties observed in the training data. With increasing values the dissimilarity increases. The AOA has only two values: 0 and 1. 0 means that a location is outside the area of applicability, 1 means that the model is inside the area of applicability.
spplot(deratify(prediction), col.regions=cols,maxpixels=ncell(prediction))+
spplot(AOA$AOA,col.regions=c("grey","transparent"))
In the second example we will look at a regression task: We have point measurements of Leaf area index (LAI), and, based in this, we would like to make predictions for the entire forest. Again, we will use the Sentinel data as potnetial predictors.
mof_sen <- stack("data/sentinel_uniwald.grd")
LAIdat <- st_read("data/trainingsites_LAI.gpkg")
## Reading layer `trainingsites_LAI' from data source
## `/home/hanna/Documents/Github/HannaMeyer/EON2022/Machine_Learning_Session/data/trainingsites_LAI.gpkg'
## using driver `GPKG'
## Simple feature collection with 67 features and 10 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 476350 ymin: 5631537 xmax: 478075 ymax: 5632765
## Projected CRS: WGS 84 / UTM zone 32N
mapview(LAIdat)
trainDat <- extract(mof_sen,LAIdat,df=TRUE)
trainDat$LAI <- LAIdat$LAI
As a simple first approach we might develop a linear model. Let’s assume a linear relationship between the NDVI and the LAI
plot(trainDat$NDVI,trainDat$LAI)
model_lm <- lm(LAI~NDVI,data=trainDat)
summary(model_lm)
##
## Call:
## lm(formula = LAI ~ NDVI, data = trainDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.87314 -0.52143 -0.03363 0.63668 2.25252
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.8518 1.4732 -0.578 0.56515
## NDVI 6.8433 2.3160 2.955 0.00435 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8887 on 65 degrees of freedom
## Multiple R-squared: 0.1184, Adjusted R-squared: 0.1049
## F-statistic: 8.731 on 1 and 65 DF, p-value: 0.004354
abline(model_lm,col="red")
prediction_LAI <- predict(mof_sen,model_lm)
plot(prediction_LAI)
Let’s use the NNDM cross-validation approach.
nndm_folds <- nndm(LAIdat,mof_sen,min_train = 0.4)
Question?! Do you have an idea why somany training points excluded during model training?
Let’s explore via plot_geodist.
geodist <- plot_geodist(LAIdat,mof_sen,cvfolds = nndm_folds$indx_test,cvtrain = nndm_folds$indx_train)
ctrl <- trainControl(method="cv",
index=nndm_folds$indx_train,
indexOut = nndm_folds$indx_test,
savePredictions = "all")
model <- ffs(trainDat[,predictors],
trainDat$LAI,
method="rf",
trControl = ctrl,
importance=TRUE,
verbose=FALSE)
model
## Random Forest
##
## 67 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 26, 26, 26, 26, 50, 26, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.7289130 NaN 0.7289130
## 3 0.7321233 NaN 0.7321233
## 4 0.7317000 NaN 0.7317000
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
Let’s then use the trained model for prediction.
LAIprediction <- predict(mof_sen,model)
plot(LAIprediction)
mapview(LAIprediction)
Question?! Why does it look so different than the linear model?
AOA <- aoa(mof_sen,model)
spplot(LAIprediction)+
spplot(AOA$AOA,col.regions=c("black","transparent"))